Sequencing and Raw Sequence Data Quality Control ◾ 13
for good quality in the high-throughput sequencing (HTS). Table 1.2 shows some of the Q
scores and corresponding error probability, base call accuracy, and interpretation.
1.4 FASTQ FILES
The sequencing technologies like Illumina are provided with the Real-Time Analysis (RTA)
software that stores individual base call data in intermediate files called BCL files. When
the sequencing run completes, these BCL files are filtered, demultiplexed if the samples are
multiplexed, and then converted into a sequence file format called FASTQ. There will be a
single FASTQ file for each sample for a single-end run and two FASTQ files (R1 and R2)
for each sample for a paired-end run: R1 file for forward reads and R2 for reverse reads.
The FASTQ files are usually compressed and they may have the file extension “*.fastq.gz”.
A FASTQ [7] file is a human-readable file format that has become de facto standard for
storing the output of most HTS technologies. A FASTQ file consists of a number of records,
with each record having four lines of data as shown in Figure 1.6.
The first line of each record of a FASTQ file begins with the “@” symbol and this line is
called the read identifier since it identifies the sequence (read). A typical FASTQ identifier
line of the reads generated by an illumine instrument looks as follows:
@<instrument>:<run num>:<flowcell ID>:<lane>:<tile>:<x>:<y>:<UMI>
<read>:<filtered>:<control num>:<index>
Table 1.3 describes the elements of the Illumina FASTQ identifier line and Figure 1.6
shows an example FASTQ file with three read records. The sequence observed in the index
sequence (part of the adaptor) is written to the FASTQ header in place of the sample num-
ber. This information can be useful for troubleshooting and demultiplexing. However,
these metadata elements may be altered or replaced by other elements especially when they
are submitted to a database or altered by users.
The second line of the FASTQ file contains the bases inferred by the sequencer. The
bases include A, C, G, and T for Adenine, Cytosine, Guanine, and Thymine, respectively.
The character N may be included if the base in a position is ambiguous (was not deter-
mined due to a sequencing fault).
The third line starts with a plus sign “+”, and it may contain other additional metadata
or the same identifier line elements.
TABLE 1.2 Phred Quality Score and Error Probability and Base Call Accuracy
Q
Error Probability
Base Call Accuracy (%)
Interpretation
10
0.1
90
1 error in 10 calls
20
0.01
99
1 error in 100 calls
30
0.001
99.9
1 error in 1,000 calls
40
0.0001
99.99
1 error in 10,000 calls
50
0.00001
99.999
1 error in 100,000 calls
60
0.000001
99.9999
1 error in 1000,000 calls